Work stealing for GPU-accelerated parallel programs in a global address space framework

نویسندگان

Humayun Arafat

James Dinan

Sriram Krishnamoorthy

Pavan Balaji

P. Sadayappan

چکیده

Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed-memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task-parallel programs executed on hybrid distributed-memory CPU-graphics processing unit (GPU) systems in a global-address space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a function of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU-GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state-of-the-art CCSD(T) application module from the computational chemistry domain. Copyright © 2015 John Wiley & Sons, Ltd.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Hierarchical Work Stealing on Manycore Clusters

Partitioned Global Address Space languages like UPC offer a convenient way of expressing large shared data structures, especially for irregular structures that require asynchronous random access. But the static SPMD parallelism model of UPC does not support divide and conquer parallelism or other forms of dynamic parallelism. We introduce a dynamic tasking library for UPC that provides a simple...

متن کامل

Large-scale genome-wide association studies on a GPU cluster using a CUDA-accelerated PGAS programming model

Detecting epistasis, such as 2-SNP interactions, in Genome-Wide Association Studies (GWAS) is an important but time consuming operation. Consequently, GPUs have already been used to accelerate these studies, reducing the runtime for moderately-sized datasets to less than one hour. However, single-GPU approaches cannot perform large-scale GWAS in reasonable time. In this work we present multiEpi...

متن کامل

Resolutions of the Coulomb operator: VIII. Parallel implementation using the modern programming language X10

Use of the modern parallel programming language X10 for computing long-range Coulomb and exchange interactions is presented. By using X10, a partitioned global address space language with support for task parallelism and the explicit representation of data locality, the resolution of the Ewald operator can be parallelized in a straightforward manner including use of both intranode and internode...

متن کامل

A Cross-Input Adaptive Framework for GPU Programs Optimization

Recent years have seen a trend in using graphic processing units (GPU) as accelerators for generalpurpose computing. The inexpensive, single-chip, massively parallel architecture of GPU has evidentially brought factors of speedup to many numerical applications. However, the development of a high-quality GPU application is challenging, thanks to the large optimization space and complex unpredict...

متن کامل

Optimizing Partitioned Global Address Space Programs for Cluster Architectures

Optimizing Partitioned Global Address Space Programs for Cluster Architectures by Wei-Yu Chen Doctor of Philosophy in Computer Science University of California, Berkeley Professor Katherine A. Yelick, Chair Unified Parallel C (UPC) is an example of a partitioned global address space language for high performance parallel computing. This programming model enables application to be written in a s...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Concurrency and Computation: Practice and Experience

دوره 28 شماره

صفحات -

تاریخ انتشار 2016

Work stealing for GPU-accelerated parallel programs in a global address space framework

نویسندگان

چکیده

منابع مشابه

Hierarchical Work Stealing on Manycore Clusters

Large-scale genome-wide association studies on a GPU cluster using a CUDA-accelerated PGAS programming model

Resolutions of the Coulomb operator: VIII. Parallel implementation using the modern programming language X10

A Cross-Input Adaptive Framework for GPU Programs Optimization

Optimizing Partitioned Global Address Space Programs for Cluster Architectures

عنوان ژورنال:

اشتراک گذاری